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Abstract 

Given observations of a collection of covariates and responses {Y,X) € x K'J, sufficient 
dimension reduction (SDR) techniques aim to identify a mapping / : —>■ with k q such 
that Y\f{X) is independent of X. The image f{X) summarizes the relevant information in a 
potentially large number of covariates X that influence the responses Y. In many contempo¬ 
rary settings, the number of responses p is also quite large, in addition to a large number q of 
covariates. This leads to the challenge of fitting a succinctly parameterized statistical model to 
Y\f{X), which is a problem that is usually not addressed in a traditional SDR framework. In 
this paper, we present a computationally tractable convex relaxation based estimator for simul¬ 
taneously (a) identifying a linear dimension reduction f{X) of the covariates that is sufficient 
with respect to the responses, and (b) fitting several types of structured low-dimensional models 
- factor models, graphical models, latent-variable graphical models - to the conditional distri¬ 
bution of Y\f{X). We analyze the consistency properties of our estimator in a high-dimensional 
scaling regime. We also illustrate the performance of our approach on a newsgroup dataset and 
on a dataset consisting of financial asset prices. 

Keywords: ii norm regularization; nuclear norm regularization; Graphical Lasso; 
high-dimensional inference; algebraic statistics. 


1 Introduction 


Sufficient dimension reduction (SDR) is a framework for identifying a low-dimensional approxima¬ 


tion of a large collection of covariates that is sufficient for predicting a set of responses (Li 

Given covariates X G 


Duan &: Li 

1991 

Cook & Weisberg 

1991 


1991 


objective of SDR is to obtain a mapping / : —)• 

independent of the covariates X conditioned on f{X) 


and responses V G M^, the 
with k q such that the responses Y are 
; equivalently, the conditional distribution 
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of Y\f{X) is the same as that of Y\X. The image f{X) is called a dimension reduction of the 
covariates X that is sufficient with respect to the responses Y, and it summarizes the relevant 
information in X that influences Y. 

In many contemporary settings, the number of responses Y is also quite large (in addition to 
a potentially large number of covariates X). For example, in financial modeling applications the 
responses are the prices of financial assets (numbering in the several hundreds) and the covariates 
may be macroeconomic indicators (see the numerical experiment in Section . In gene microar¬ 
ray analysis, the responses correspond to the expression levels of a large number of genes (on the 
order of tens of thousands), and the covariates could be a collection of physiological attributes 
( Cheung Sz Spielman| 2002[ Brem &; Kruglyak, 2005). In these problem domains, the conditional 
distribution of Y\f{X) is specified using a large number of parameters, and as a result the esti¬ 
mation of this conditional distribution leads to several statistical and computational challenges - 
e.g., problems with overfitting when given a modest number o f observations, and difficulties with 
developing algorithmic procedures that operate within a reasonable computational budget. These 


challenges in high-dimensional inference are by now well-recognized (Biihlmann Sz van de Geer 


2011; Wainwright, 2014), and they have been addressed in several settings based on approxima¬ 


tions of high-dimensional distributions consisting of many degrees of freedom by elements from 
structured classes of models specified using a small number of parameters; examples of such struc¬ 


tured families include models described by sparse or banded covariance matrices (Bickel Sz Levina 


2008a 

b El Karoui| 2008 

), graphical models ( 

Yuan Sz Lin 

2006 Friedman et al. 

2008 

Ravikumar 

et al. 

2008; Rothman et al. 

2008 

), and factor models (Fan et al., 2008); see Wainwright (2014) for 


a more extensive list of references. 


1.1 Our Contributions 

Building on this prior literature, we describe a new methodology based on convex optimization that 
integrates sufficient dimension reduction with techniques to fit succinctly parameterized models to 
the high-dimensional conditional distribution of the responses given the covariates. In particular, 
given observations of a set of responses Y and covariates X, we fit a linear Gaussian model with 
the following two properties: (a) there exists a low-dimensional linear dimension reductiorj^ /(X) 
of the covariates that is sufficient with respect to the responses, and (6) the conditional distribution 
of T|/(X) is specihed by a concisely parameterized statistical model - the three concrete examples 
that we consider are a factor model, a graphical model, and a latent-variable graphical model. 

If (Y, X) G X is a jointly Gaussian random vector with covariance matrix 

- Sx ) ■ 

sufficient with respect to the responses Y is equivalent to the rank of the cross-covariance matrix 
Yyx being at most k (of interest here is the setting in which k < min{p, g}). In particular, the 
sufficient dimension reduction /(X) in such cases is given by the fc-dimensional row-space of the 
matrix T,yx ■ which is the mapping that specifies the best linear estimator of Y based on 

X. On the other hand, the conditional statistics of the responses given the covariates are specified 
by the submatrix [S“^]y of the joint precision matrix As a result, the problem of fitting a 

concisely parameterized model to the conditional distribution of the responses given the covariates 
is more conveniently specihed in terms of structured approximations of a submatrix of the precision 


the existence of a dimension-fc linear projection of the covariates X that is 


^If the covariates and responses are jointly Gaussian, it suffices to consider dimension reductions specified by linear 


mappings of the covariates (Li 1991 Chechik et al. 20051. Even in more general settings, many approaches to SDR 


construct linear dimension reductions of the covariates for computational reasons (Li 1991 Duan & Li 1991 Cook 


& Weisberg 

1991 

Li 

1992 Cook & Ni 

2005 
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matrix for example, fitting a graphical model to the conditional distribution of the responses 
given the covariates corresponds to approximating the submatrix by a sparse matrix. The 

different parameterizations in which these two modeling tasks are most naturally described - SDR 
in terms of covariance matrices and conditional modeling of the responses given the covariates in 
terms of precision matrices - poses an obstruction to their integration into a single framework. 

We overcome this difficulty by making the observation that rank(Syx) = i'ank([S“^]yx) in non¬ 
degenerate models (i.e., the joint covariance matrix S is positive definite) based on the following 
relation between the two alternative forms - in terms of the precision matrix and in terms of the 
covariance matrix - of the mapping that specified the best linear estimate of Y based on X: 

Yyx ■ =-[Y~^]y ■ [Y~^]yx ( 1 - 1 ) 


Hence, our approach for integrating SDR and conditional modeling of the responses given the 

0 y 0 yx\ 


covariates is to fit a Gaussian model specified by a precision matrix 0 = 


0 ; 


yX ©X J 


to a 


collection of observations of covariates and responses such that (a) the submatrix 0 yx is low-rank, 
and (b) the submatrix 0y (which specifies the conditional distribution of the responses given the 
covariates) has a concise parameterization that describes a structured model. This reformulation of 
our modeling framework in terms of precision matrices leads naturally to the following general form 
of an estimator, given joint observations {Y^'^\ C of the responses and covariates: 


argmin - i{G; {y«,X«})Li) + A„[7||0yx||* + R(0 y)] 
eeSp+'3, eyo 


( 1 . 2 ) 


The set denotes the space of k x k symmetric matrices, the function £(0; de¬ 
notes the log-likelihood of the observations with respect to a Gaussian distribution 

parameterized by the precision matrix 0 , the function || • ||* denotes the nuclear norm (sum of 
the singular values of a matrix), and the function R : —>■ M is suitably chosen to promote a 
desired structure in the submatrix 0y (i.e., the conditional distribution of the responses given the 
covariates). The role of the nuclear norm penalty is to promote low rank structure in the subma¬ 
trix 0 yx; this regularizer has been successfully employed in many settings for fitting structured 
low-rank models to high-dimensional data (Fazel 2002 Recht et al. 2009). Here An > 0 and 7 > 0 


are regularization parameters. By virtue of the convexity of norms and of the negative of the log- 
likelihood function £{Q; the program ( 1 . 2 ) is a convex optimization problem if the 


function R(0y) is convex. In the following discussion, we provide three concrete approaches to fit 
structured low-dimensional models to the conditional distribution of R|/(X) via suitable choices 
of the function R(0y). 


SDR + Factor Modeling (SDR-FM) Fitting a factor model to the conditional distribution 
of Y\f(X) corresponds to approximating the covariance matrix of Y\f{X) as the sum of a diagonal 


matrix and a low-rank matrix. By appealing to the Sherman-Morrison-Woodbury formula (Horn 


& Johnson, 1990), the covariance matrix of Y\f{X) being decomposable as the sum of a diagonal 


matrix and a low-rank matrix is equivalent to the precision matrix of Y\f{X) being decomposable 
as the difference between a diagonal matrix and a low-rank matrix. Thus, a natural choice for the 
regularizer R(0y) is: 


R(0y) = inf trace(Ly) 
r>y,LyeSp 


subject to 0y = Dy — Ly, Ly R 0 , Dy is diagonal. 


(1.3) 


Here Dy,Qy represent the diagonal and low-rank components of 0y. As before, the role of the 
nuclear norm penalty (the nuclear norm for positive semidefinite matrices reduces to the trace) is 
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to enforce low-rank structure in the Ly component. 


SDR -h Graphical Modeling (SDR-GM) Fitting a sparse graphical model to the condi¬ 
tional distribution of Y\f(X) corresponds to approximating the submatrix 0y by a sparse matrix; 
the sparsity pattern of 0y specifies the graphical model structure underlying the conditional dis¬ 


tribution of Y\f{X). Based on prior work on the Graphical Lasso (Yuan &: Lin 

al 


2006 Friedman et 


2008), an appropriate choice for the regularizer R(0y) is: 

i?(0r) = ||0y|k. 


(1.4) 


The function || • \\i^ denotes the norm (sum of the magnitudes of the entries of a matrix), and its 
role is to induce sparsity in the submatrix 0y. Regularizers based on the ii norm have been widely 
and successfully employed in many settings for htting sparse models to data in high dimensions 
(]Tibshirani 1996; Chen et al., 2001; Candes et al. 2006 


with i?(0y) = ||0y is a natural extension of the Graphical Lasso (Yuan & Lin, 2006; Friedman 


Donoho, 2006). The convex program (1.2) 


et al., 2008) in which an norm penalty is employed to induce sparsity in the precision matrix 


(although the Graphical Lasso operates purely on observations of a set of responses and it does not 
consist of a nuclear norm penalty corresponding to an SDR objective). 

SDR -|- Latent-Variable Graphical Modeling (SDR-LVGM) We also describe a general¬ 
ization of the SDR-GM approach. In practice, it may be expensive or infeasible for a data analyst 
to gather observations of all the relevant covariates that may potentially impact the responses. 
As a result, the responses Y could be affected by unobserved latent variables in addition to being 
influenced by the covariates X. These latent variables can lead to confounding dependencies among 
the responses, which in turn complicates the task of fitting a graphical model to the conditional 
distribution of Y|/(A). Motivated by these considerations, it is of interest to fit a graphical model 
to the distribution of Y\f{X),(^, where C G (here h ^ p) represent a small number of latent 


variables that are statistically independent of f{X). As described by Ghandrasekaran et al. (2012), 


fitting such a latent-variable graphical model corresponds to approximating the submatrix 0y by 
the sum of a sparse matrix and a low-rank matrix, rather than just a sparse matrix as in the case of 
a pure graphical model approximation. Here the low-rank component of 0y accounts for the effect 
of the latent variables C, G on the responses Y, and the rank of this component is equal to the 
dimension h of The sparse component of 0y specifies the graphical model structure underlying 
Y\f{X), Building on the insights in Ghandrasekaran et al. (2012), the regularizer Rs{Qy) in this 
setting chosen as: 


Rs{&y) = inf trace(Ly)-|-(5||5y 11^^ subject to 
Sy ,Ly eSp 


0y = Sy — Ly, Ly Y 0. 


(1.5) 


The matrices SyjLy G in (1.5) correspond to the sparse and low-rank components of 0y, 


respectively, and <5 > 0 is a regularization parameter. As before, the £i norm and the nuclear norm 
penalties promote the type of structure that we desire in our model. Plugging in the regularizer 


Rs{Qy) into (1.2), we obtain the following optimization program for joint SDR and latent-variable 
graphical modeling given observations of the responses and the covariates: 


(0,Ay,Ly) = argmin _£(0; {yW, x»}ti) + A„[ 7 || 0 yx|U + trace(Ly) + 5||Sy H.J 
eeSP+1, eyo 
SyLy&S-p 


s.t. 


0y = Ay — Ly, Ly Y 0. 


( 1 . 6 ) 


The convex program (1.6) is a natural extension of the estimator proposed by Ghandrasekaran et 


al. (2012) in which a trace penalty and an li norm penalty are employed to fit a latent-variable 
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graphical model to a set of responses. To be clear, the estimator (1.6) also incorporates SDR to 
identify a low-dimensional projection of a set of covariates that are sufficient for predicting the 
responses, which is in contrast to the estimator proposed by Chandrasekaran et al. (2012). 

The regularizers defined in (1.3), (1.4), and (1.5) are convex with respect to the submatrix 0y. 
As a result, the estimators corresponding the the SDR-FM, SDR-GM, and SDR-LVGM approaches 
are convex optimization programs. The estimator (1.6) corresponding to the SDR-LVGM approach 
is in some sense a generalization of the estimators corresponding to the SDR-GM and SDR-FM 
approaches, as latent-variable graphical modeling may be viewed as a blend of factor modeling and 
graphical modeling. As a result, we focus in Section on analyzing the consistency properties of 
the estimator (1.6) in a high-dimensional scaling regime. Specihcally, suppose we observe samples 
C of a collection of jointly Gaussian responses and covariates (Y,X) G 


with population precision matrix 0* = 


S^-L^y 

0 * ' 
^YX 


0 * 

^YX 

0 * 


G Supplying these observations 


as input to the convex program (1.6) and obtaining estimates (0,S'y,Ly) G x x S^, we 
prove in Theorem 2.1 that (under certain conditions on 0* and with high probability) the rank 
of the submatrix 0yx is equal to the rank of 0yx; th® rank of Ly is equal to the rank of Ly, 
and the sparsity pattern of Sy is the same as that of Sy] these recovery guarantees imply that 
we obtain the correct dimension of the image f{X) specifying the sufficient dimension reduction 
of the covariates, the correct number of unobserved latent variables, and the correct conditional 
graphical model structure underlying the population. Informally, the assumptions on the population 
precision matrix 0* are that: (a) the submatrix Qyx sufficiently low-rank; (6) the submatrix 
0y = Sy — Ly is such that Sp is sufficiently sparse and Ly is sufficiently low-rank; and (c) the 
population Fisher information 0*“^ (8) 0*”^ obeys certain irrepresentability-type conditions; see 
Assumptions 1 and 2 in Section 2.2 Theorem |2.1| and the subsequent discussion in Appendix 5.2 
for a precise formulation of these conditions. The first assumption above on 0* states that there 
exists a low-dimensional linear projection f{X) of the covariates X that is sufficient with respect 
to Y. The second condition requires that Y\f{X) is specified by a latent-variable graphical model 
with a small number of latent variables and a sparse graphical model. The third assumption is 
analogous to the irrepresentability conditions that play a role in the analysis of the consistency of 


the Lasso ( 

Meinshausen &: Biihlmann 

2006 

Zhao & Yn 

, 2006 

Wainwright 

, 2014 

), the Graphical 

Lasso ( 

Raviknmar et al.| 

2008 

), the convex relaxation proposed by Ghandrasekaran et al. (2012) 


for latent-variable graphical modeling, as well as other estimators in high-dimensional inference 
problems (Biihlmann &: van de Geer, 2011 Wainwright, 2014). 

In Section we illustrate the performance of the estimators corresponding to the SDR-FM, 
SDR-GM, and SDR-LVGM approaches on two datasets. First, we consider a financial asset mod¬ 
eling problem in which the responses are a collection of stock returns of 67 companies from the 
Standard and Poor index, and the covariates are the following 7 macroeconomic indicators: the 
industrial production index, the inflation rate, the amount of oil exports, the population growth 
rate, the nnemployment rate, the consumer credit score, and the EUR to USD exchange rate. In the 
second experiment, we analyze the 20newsgroup dataset that consists of 16,242 samples in 
with each observation corresponding to a news document. The coordinates of these observations 
are indexed by a collection of 100 words, and each observation is a binary vector specifying whether 
a word appears in the document. Of those 100 words, the following 9 words are chosen to be the 
covariates as they appear to be useful in categorizing newsgroup documents: government, religion, 
science, technology, war, medicine, world, food, and games. The remaining 91 words are used as 
response variables. 
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Adapting to Alternative Forms of SDR In many settings, one is interested in identifying 
a subset of the covariates X that is useful for predicting the responses Y, rather than a generic 
dimension reduction of A. In such cases, it is more natural to fit a linear Gaussian model with a 
precision matrix 0 G in which the submatrix &yx is column-sparse (i.e, only a subset of the 
columns of this matrix are nonzero) instead of being low-rank. This point follows from the relation 
(1.1) by noting that the map —specifying the best linear estimator of Y based 


on X must be column-sparse if a subset of the covariates is sufficient for predicting the responses 
(the indices of the nonzero columns of Qyx correspond to the subset of the covariates that are 
relevant for predicting the responses). To fit a model with a column-sparse submatrix Qyx, one 
can modify the family of estimators (B by replacing the nuclear norm penalty ||0yx|U with a 
group norm penalty || 0 yx|| 2 ,i = J2i=i \\{^Yx):,i\\e 2 y where (0yv:);,i represents the i’th column of 
0yx- Such group norm penalties are useful for inducing sparsity in entire columns of Qyx (Yuan 


& Lin, 2006). As an illustration, given observations C the estimator (1.6) can 


be modified as follows: 

{Q,Sy,Ly) = argmin -£(0; {YW, AW}(Li)- h An[7||0yv:||2,i + trace(Ly)-h (fUS'y H^J 
eeSP+1, eyo 

S.t. 0y = Sy — Ly, Ly Y 0. (lY) 

This estimator simultaneously identifies a subset of the covariates that are relevant for predicting 
the responses and also fits a latent-variable graphical model to the conditional distribution of the 
responses given the covariates. The analysis of the statistical consistency of this estimator is similar 


in spirit to that of the estimator (1.6); see Appendix 5.3 for more details 


1.2 Related Work 

Many researchers have developed techniques for computing sufficient dimension reductions (see 
the survey ( Adragni &: CooH 2009) and the references therein), with Sliced Inverse Regression 
being a prominent example (Li, 1991). In a jointly Gaussian setting, classical approaches such 


as Canonical Correlations Analysis (CCA) or Partial Least Squares (PLS) may also be employed 


to compute linear sufficient dimension reductions (Fung et al. 2002), although the objectives of 


CCA and PLS are somewhat different than that of SDR. More recently, Negahban &: Wainwright 


(2011) employed a nuclear norm penalty in a multivariate linear regression setup to identify a low¬ 


dimensional projection of a set of covariates that best predicts a set of responses. However, none 
of these papers consider the additional challenge of modeling the conditional distribution of the 
responses given the covariates. 

A number of researchers have developed methods for simultaneously obtaining a concise model 
of the predictive relationship of the covariates on the responses, while also fitting a sparse graphical 
model to the conditional distribution of the responses given the covariates. For example, the 


techniques introduced by Rothman et al. (2008), Cai et al. (2010), and Yin & Li (2011) may be 
interpreted as seeking a sparse approximation to the matrix Yyx ■ (or equivalently, —0y^ • 
Qyx) that specifies the best linear estimator of Y based on A, in addition to computing a sparse 
approximation to the submatrix 0y. The algorithms proposed in these papers are either non-convex 


or involve multi-step procedures consisting of several convex programs. In a different direction, Sohn 


X Kim (2012) and Yuan &: Zhang (2014) consider a regularized log-likelihood convex program with 


li norm penalties on the submatrices 0y and 0yx of the joint precision matrix. In contrast to these 
papers, our approach for modeling the predictive relationship of the covariates on the responses is 
based on SDR, where we seek a low-dimensional projection of the covariates that is sufficient for 
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predicting the responses. Further, our framework integrates SDR and conditional modeling of the 
responses given the covariates via a single convex program. 

In the same work referenced above, Yuan & Zhang (2014) also consider the problem of selecting 
a subset of a collection of covariates that are most relevant for predicting a set of responses, while 
additionally fitting a sparse graphical model to the conditional distribution of the responses given 
the selected covariates. They address this problem by proposing the following regularized log- 
likelihood convex program consisting of an || • || 2 ,i norm penalty on Qyx and an t’l norm penalty 
on ©y: 


0 = argmin -^(0; {Y», + A„[ 7 || 0 yx|| 2 ,i + ||0y lkj 

0eSP+«, eyo 


( 1 . 8 ) 


In Appendix 5.3, we analyze the high-dimensional consistency of the estimator (1.7) - which fits a 


latent-variable graphical model to the conditional distribution of the responses given the selected 
covariates rather than just a graphical model - and this analysis can be specialized to obtain 


consistency results for the estimator (1.8). More broadly, one of the key distinctions between our 


work and that of Yuan &; Zhang (2014) is that our approach (1.2) for simultaneous SDR and 


conditional modeling of responses given covariates is useful for obtaining general linear sufficient 
dimension reductions of the covariates rather than just selecting subsets of relevant covariates. 
Moreover, our framework can be adapted to fit three types of models to the conditional distribution 
of the responses conditioned on the covariates. 


1.3 Notation 

Given a matrix U G norm denotes the largest entry in magnitude of U, and the 

norm \\U \\2 denotes the spectral norm (the largest singular value of U). The norm ||C/|| 2 ,oo denotes 
the maximum of the £2 norms of the columns of U: ||I7||2,oo = aiaxj=i^ 2 ,...p 2 11^:,* 11^2 • ^Ve denote the 
set of kxk positive semidefinite and positive-definite matrices by and §++, respectively. Finally, 
the linear operators ^ x x x S'? —)• and its adjoint S?’ x S?’ x x S'? 

are defined as follows: 


A{M,N,K,0) = 


M-N 


K 

O 


.At 


Q 


K 

O 


^{Q,Q,K,0) 


(1.9) 


2 Model Selection Consistency 

As described in the introduction, we investigate the consistency properties of the SDR-LVGM esti¬ 


mator (1.6), which integrates SDR and latent-variable graphical modeling of the conditional distri¬ 


bution of the responses given the covariates. The main result is stated in Section 2.3, and it is based 


on assumptions on the population precision matrix (discussed in Section 2.1) and irrepresentability- 


type conditions on the population Fisher information (discussed in Section 2.2). 


2.1 Technical Setup 


Let 0* = 


0 * 

Uy 

0 * ' 


0 * 

'^YX 

0 * 

'^x 


G Sy!? denote the precision matrix of a jointly Gaussian random vector 


{y,X) G M?’+'? of covariates and responses, and let S* = 0*“^ denote the corresponding covariance 
matrix. The submatrix Qyx ^ is a rank-A: matrix, where k <C min{p, q} is the size of 

the smallest dimension reduction f{X) of the covariates X that is sufficient with respect to the 
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X X with Qy = Sy — Ly is a structurally 
G S^+^xSPxS^ with Q^ = S^- 


responses Y. The submatrix 0y G is the precision matrix of the conditional distribution of 
Y\f{X), and it specifies a latent-variable graphical model (Chandrasekaran et ah, 2012). That is, 
the matrix 0y is decomposed as 0y = Sp — Lp - the component Sp is a sparse matrix representing 
the precision matrix of the distribution of Y conditioned on f{X) as well as a small number of 
additional unobserved latent variables C £ (here h p), and the component Ly is a low-rank 
matrix representing the effect of the latent variables C (rank(Ly) = h). These structural attributes 
of our model lead to the following definition: 

Definition 2.1. An estimate {Q,Sy,Ly) G 
correct estimate of the model specified by the matrices (0*, Sp, Lp 
Lp if (1) rank(0yx) = rank(0yj^) and (2) sign(5y) = sign(5'y) {here sign(O) = 0); rank(Ly) = 
rank(Ly). 

Condition (1) ensures that the size of the smallest dimension reduction f{X) of the covariates X 
that is sufficient with respect to the responses Y is estimated correctly. Condition (2) ensures that 
the latent-variable graphical model specifying the conditional distribution of Y\f{X) is estimated 
correctly, which corresponds to accurately identifying the two components composing the precision 
matrix 0y of Y\f{X). In particular, this condition ensures that (a) Sy provides a structurally 
correct estimate of the graphical model specifying the conditional distribution of Y\f{X),f] that 
is, positive, negative and zero entries in Sp are estimated correctly as positive, negative, and zero 
entries, respectively, in Sy, and (6) the dimension of the latent variables f is estimated correctly. 


Following the literature on high-dimensional estimation (see the surveys (Biihlmann X van de 


Geer, 2011 Wainwright, 2014) and the references therein), a natural set of conditions for obtaining 


consistent and structurally correct parameter estimates is to assume that the curvature of the 
likelihood function at 0* is bounded in certain directions. This curvature is governed by the Fisher 
information at 0*: 


1 ( 0 *) = 0 




0*"^ = S* ^ 


Here (8> denotes a tensor product between matrices and 1(0*) may be viewed as a map from 
to We impose conditions requiring that 1(0*) is well-behaved when applied to matrices of 

_ [{Sy - Sp) - {Ly - Lp) Qyx - Qpx^ 

Qyx' - Qpx' 


the form 0 — 0* = 


where S'y is in a neighborhood of 


Sp restricted to the set of sparse matrices and {Ly,Qyx) are in a neighborhood of {Lp,Qp^) 


restricted to sets of low-rank matrices. As formally described in Section 2.2, these local properties 
of 1(0*) around 0* are conveniently stated in terms of tangent spaces to the algebraic varieties of 
sparse matrices and of low-rank matrices. 

Let M G be a symmetric matrix with k nonzero entries. The tangent space at M with 
respect to the algebraic variety of matrices in E^ with at most k nonzero entries is given by: 


n{M) = {J G support(J) C support(M)}. 


Here ‘support’ denotes the set of locations of the nonzero entries. Next, consider a rank-r matrix 
N G reduced singular value decomposition (SVD) given by = UDV, where U G 

D G W^'^, and V G tangent space at N with respect to the algebraic variety of 

Pi xp2 matrices with rank less than or equal to r is given 

T{N) = {UY; + YW'lYi € K.’’^‘^’',Y2 € R« 

■^We also consider the tangent space at a symmetric low-rank matrix with respect to the algebraic variety of 
symmetric low-rank matrices. We use the same notation T’ to denote tangent spaces in both the symmetric and 
non-symmetric cases, and the appropriate tangent space is clear from the context. 















In the next section, we describe ir represent ability conditions on the population Fisher information 
1 (0*) in terms of the tangent spaces Q{Sy), T{Ly), and T(0yj^); under these conditions, we 
prove in Appendix 5.1 that the regularized maximum-likelihood convex program (1.6) provides 
structurally correct and consistent estimates. 


2.2 Fisher Information Conditions For Consistency 

Let I* = 1(0*) denote the population Fisher information at 0*. Given a norm ||.||x on x x 
^ condition we consider is to bound the minimum gain of I* restricted to a 

subspace H C x x x as follows: 

x(EI, IMIt) — min AVu{Sy, LY,&YX,&x)\\r, (2.1) 

(Sy,Ly,0yx,©x)GlH[ 

||(5y,Ly,©yx,©x)llT = l 


where denotes the projection operator onto the subspace El and the linear maps A and A^ 
are defined in (1.9). The quantity x(EI, ||.||t) being large ensures that the Fisher information I* 
is well-conditioned restricted to image .AEI C The second condition that we impose on I* 


is in the spirit of irrepresentibility-type conditions ( 

Meinshausen & Biihlmann 

2006 

Zhao & Yu 

2006 

Wainwright 

2009 

Ravikumar et al. 

2008 

Chandrasekaran et ah, 2012) that are frequently 


employed in high-dimensional estimation. Specifically, we require that the inner-product between 
elements in .4.EI and .4.EI-*-, as quantified by the metric induced by I*, is bounded above: 


^(e,||.||T)= max \\V^^Ah*AVu{rMAh*AVm)~Hz)\\r. 
||Z||t=1 


( 2 . 2 ) 


The operator (TmA"'!*AV m)~^ in 
implies that I* is injective restricted to ylEI. The quantity ^^(BI, ||.||t) being small implies that any 
element of .4,EI and any element of .AEt-*- have a small inner-product (in the metric induced by I*). 

A natural approach to controlling the conditioning of the Fisher information around 0* is to 
bound the quantities x(EI*, || • ||t) and y?(EI, ||.||t) for El* = ll(5y) xr(Ly) xT{&yx) However, 
a complication that arises with this approach is that the varieties of low-rank matrices are locally 
curved around Ly and around 0yx • Consequently, the tangent spaces at points in neighborhoods 
around Ly and around 0yx a-re not the same as T'(Ly) and T{Byx)- similar difficulty does 
not arise with sparse matrices, as the variety of sparse matrices is locally flat around Sy', hence, 
the tangent spaces at all points in a neighborhood of 5y are the same.) In order to account for 
this curvature underlying the varieties of low-rank matrices, we bound the distance between nearby 
tangent spaces via the following induced norm: 

p(ri,r2)4^maxJ|(PTi-L’rJ(iV)||2. 


(2.2) is well-defined if x(EI, || • ||x) > 0, since this latter condition 


Using this approach for bounding nearby tangent spaces, we consider subspaces El' = 0(5y) x 
Ty X Tyx ^ foi' all Ty close to T{Ly) and for all Tyx dose to r(0yy), as measured by p 
(Chandrasekaran et ah, 2012). For ujyi'^yx £ (0)1)) we bound x(IH^ IMIx) and (/^(EI', ||.||t) in the 


sequel for all subspaces El' in the following set: 


U{ojY,ujYx) = {H(5f) xT{.x T{^x x I p{Ty,T{L*y)) < 

p{T^x,T{Qyx))<^yx}. 


(2.3) 
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We control the quantities H-Ht) and ||.||t) using a slight variant of the dual norm of 


the regularizer 5||5y||4 +trace(Ly) + 7||0 x,y||* in (1.6): 
* 1 * 5 ,^(S'y, Ly, QyXi &x) — max 






7 


(2.4) 


As the dual norm max | ^ ||Ly|| 2 , Qf regularizer in ( |1.6[ ) plays a central role in 

the optimality conditions of ( 1 . 6 ), controlling the quantities x(EI^‘h( 5 , 7 ) and $ 5 ^ 7 ) leads to a 
natural set of conditions that guarantee the structural correctness and consistency of the estimates 
produced by ( 1.6 ). In summary, given a fixed set of parameters ((5, 7 , cjy, wyx) £ 1R+ x My x (0,1) x 
(0,1), we assume that I* satisfies the following conditions; 


Assumption 1 


Assumption 2 


inf x(EI^ ‘^’< 57 ) > ct) for some a > 0 

M'^U{uJy ^^Yx) 

sup ‘^’ 5 , 7 ) <1 — 1 ^ for some z/ G (0,1/3). 

M'gU {ujy ) 


(2.5) 


( 2 . 6 ) 


For fixed (<5, 7 , cjy, wyy:), larger values of a and z/ in these assumptions lead to a better conditioned 

r. 

Assumptions 1 and 2 are analogous to conditions that play an important role in the analy¬ 
sis of the Lasso for sparse linear regression (Meinshausen & Biihlmann, 2006; |Zhao &: Yu, 2006) 


Wainwright, 2009), graphical model selection via the Graphical Lasso (Ravikumar et al. 


and in several other approaches for high-dimensional estimation (Biihlmann &: van de Geer 


2008), 


2011 


Wainwright, 2014). As a point of comparison with respect to analyses of the Lasso, the role of 


the Fisher information I* in (2.5) and in (2.6) is played by A, where A is the underlying design 


matrix (Meinshausen & Biihlmann, 2006, Zhao &: Yu, 2006; Wainwright, 2009). In analyses of both 


the Lasso and the Graphical Lasso in the papers referenced above, the analog of the subspace H 
is the set of models with support contained inside the support of the underlying sparse population 
model. Assumptions 1 and 2 are also similar in spirit to conditions employed in the analysis of 


convex relaxation methods for latent-variable graphical model selection (Chandrasekaran et al. 


2012). However, as emphasized previously in the introduction, an important distinction between 


the present paper and prior literature on graphical model selection is that the methods and results in 
previous work are not directly applicable to the problem of simultaneous SDR and (latent-variable) 
graphical modeling. 


2.3 High-Dimensional Consistency Result 


In this section, we describe the performance of the regularized maximum-likelihood program (1.6). 
Before formally stating our main result, we introduce some notation. Recalling that 

( S* _ i* 0 * \ 

A ) is the population precision matrix, let ry denote the minimum nonzero 

entry in magnitude of S'y, let deg(5y) denote the maximal number of nonzeros per row/column 
of Sy (i.e., the degree of the graphical model underlying the conditional distribution Y\f{X)X, 
which is specified by the precision matrix Sy), let cry denote the minimum nonzero singular value 
of Ly, and let cryx denote the minimum nonzero singular value of 0 


★ 

YX- 


Theorem 2.1. Suppose we are given i.i.d observations C of a eollection of 

jointly Gaussian eovariates/responses with population preeision matrix 0* G Sy'y*'. Fix a > 0, G 
(0,1/3), wy G (0, £ (0,1). Suppose the parameters 5 and 7 are chosen such that the popu¬ 

lation Fisher information I* satisfies Assumptions 1 and 2. 
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Let m = max{j, 1, m = max{(5, 1 , 7 }, /3 = and iIj = ||(0*)“^||2- Further, Ci = ^ + ^, 
<^2 = + 1); C'ay = j^ax{12/3 + 1, + 1}; C'tryjf = C'fV'^max{18/3, + 6/3}, 

Csamp = max{jg|;^,48/3V’^C'f,8V’C'2, and Kpper = mm2deg{s^)Csamv ' '^'^ose that the 

following conditions hold: 


1. n> 46087W(p±9); that isn> 


A2 

upper 


^m^m^deg(S'y)^ {p + g) 


A„ G ^ /3my^ 

3. ty > 2Ci6Xn; that is ry > if \n ~ 

4- cry > ^CayXn; that is ay > 


-J ''n n 

if A„~/3my^ 


^ 

~ a^LjY 

5. (Tyx > ^C^Yxl'^Xn; that is ayx > */ A„ ~ 

Then with probability greater than 1 — 2 exp{ — 4 gog^ 2 ^ 2,^2 }> the optimal solution {Q,Sy,Ly) of 
( 1 . 6 ) with the observations satisfies the following properties: 


1. signlSy) = signfSy), rankfLy) = rank(LY), rankfOyx) = rankfQyx) 


2. ^s,'fiSy — Sy,Ly — Ly,Qyx — Qyx^ 


\\Ly - L^lls < 


Xn ~ 


c/x 

i* 


^XJ ^ 




± ' 'IL 7 


3?, IISv-x - ej-„.|b < fim, 


'G*. lie.A - 0^112 < if 


Notice that condition 1 of Theorem |2.1| ensures that the interval in condition 2 is non-empty. We 
give a proof of Theorem 2.1 in Appendix |5.1[ Under the assumptions of the theorem, we construct 
appropriate primal feasible variables (0, Sy, Ly) that satisfy the conclusions of the theorem - i.e., 
Qyx,Ly are low-rank (with the same ranks as the underlying population quantities 0 yx ^y) 
and Sy is sparse (with the same support as the underlying population quantity Sy) ~ and for which 
there exists a corresponding dual variable certifying optimality. This proof technique is sometimes 
also referred to as a primal-dual witness or certificate approach ( Wainwright[ [2009 ). The quantities 
a, ft (related to v, as stated in the theorem, via ft = ojy, wyx, deg(5y) as well as the choices of 
the parameters < 3,7 play a prominent role in our result. Indeed, larger values of a and v (leading to 


a better conditioned Fisher information, from Assumptions 1 and 2 in (2.5) and (2.6)) lead to less 


stringent requirements on the sample complexity, on the minimum magnitude nonzero entry ry of 
S'y, on the minimum nonzero singular value cry of Ly, and on the minimum nonzero singular value 
ayx of 0yx- ^ similar vein, larger values of the quantities ujy and ujyx in Assumptions 1 and 
2 imply that the Fisher information is well-conditioned even for large distortions of the tangent 


spaces T(Ly) and r(0yj^) (see (2.3)), which in turn lead to less stringent requirements on the 
minimum nonzero singular values cry and ayx- 

As is clear from Theorem 2 . 1 , the tradeoff parameters < 3,7 must be chosen such that the 
population Fisher information I* satisfies Assumptions 1 and 2 (see (2.5) and (2.6)) for some 


a > 0, z/ G (0, l/3),a;y > 0,cayx > 0. Recall that these assumptions are stated in terms of the 
norm as defined in (2.4). The key complication that arises in characterizing values of ( 3,7 for 
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which Assumptions 1 and 2 hold is that these parameters appear as multiplicative factors on norms 
imposed on different sub-blocks of matrices in (see the definition of in (2.4)), which lead 
to conditions on gains of the Fisher information that are coupled across the different sub-blocks of 
In order to overcome this difficulty, we describe in Appendix |5.2| a set of conditions on gains 
of the Fisher information restricted to the tangent spaces n(5y), Ty, separately, under these 
separable conditions, we explicitly characterize a non-empty polyhedral set V{a,i>,u:Y,ojYx) C 
such that Assumptions 1 and 2 hold for all ((5,7) G V{a,v,UY, ojyx)- These conditions are in¬ 
terpretable and are stated in terms of the degree of the graphical model structure underlying the 
conditional distribution of Sy (this quantity, deg(5y), makes an appearance in Theorem 2.1) and 
an incoherence parameter associated with the low-rank matrix Ly. 


3 Experiments 

We illustrate the performance of the estimators corresponding to the SDR-FM, SDR-GM, and 
SDR-LVGM approaches (recall that these estimators are given by (1.2) with the choices (1.3), 
(1.4), and (1.5) for the regularizer R(0y)) and the estimator (1.7) in statistical modeling tasks 
involving financial asset data and newsgroup data. We solve these convex programs numerically 
using the LogdetPPA package developed for log-determinant semidefinite programs (Toh et al.| 


2002). Below we discuss the details of each dataset: 


financial: We consider a financial asset modeling problem in which the responses Y are a 
collection of monthly stock returns of 67 companies from the Standard and Poor index from 1990 
to 2005 and the covariates X are the following 7 variables - Xy. EUR to USD exchange rate, X 2 : 
inflation rate, A 3 : oil exports, Xy. industrial production index, Xy. population growth rate, Xy. 
consumer price index, and A 7 : unemployment rate. Thus, we observe n = 188 samples jointly of 
(y, X) G X 


newsgroup: This dataset consists of n = 16242 observations in with each observation 

corresponding to a news document. The coordinates of these observations are indexed by a collec¬ 
tion of 100 words, and each observation is a binary vector specifying whether a word appears in a 
document. Of these 100 words, the following 9 words are designated as covariates A G as they 
are useful in categorizing newsgroup documents: Ai: government, A 2 : religion, A 3 : science, Xy 
technology, A 5 : war, Xy. medicine, A 7 : world. As: food, and Ag: games. The remaining 91 words 
specify the response Y G 


3.1 Sufficient Dimension Reduction and Conditional Modeling 


In this section, we investigate the performance of SDR-GM and SDR-LVGM (with the estimator 


(1.2) and choices (1.4) and (1.5) for the regularizer R{Qy)) on the financial dataset . To illustrate 


the utility of incorporating information about the covariates, we also contrast these methods with 
two modeling approaches that do not account for the impact of the covariates A on the responses 
y - we fit a sparse graphical model (denoted GM) as well as a latent-variable graphical model 


(denoted LVGM) to the responses Y using the Graphical Lasso technique (Yuan & Lin, 2006 


Friedman et al., [2008 ) and the approach described by Chandrasekaran et al. (2012), respectively. 

To properly compare the performance of these different techniques, we ensure that the com¬ 
plexity of each of the resulting models (in terms of the number of parameters required for their 
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specification) is approximately the samej^ We begin by choosing a regularization parameter for 
the Graphical Lasso method such that small changes in the value of this parameter do not lead to 
substantial structural changes in the estimated graphical model, i.e., the estimated model is stable 
with respect to small changes in the regularization parameter. Following this approach, we use the 
GM approach on the financial dataset (without the covariates), and the resulting graphical model 
consists of 909 parameters (842 edges plus 67 node parameters). Next, we choose regularization 
parameters for the estimators corresponding to the LVGM (without covariates), SDR-GM (with 
both covariates and responses), and SDR-LVGM (with both covariates and responses) approaches 
such that the resulting models obtained via each of these techniques consist of approximately 909 
parameters. Specifically, the model obtained using LVGM consists of 10 latent variables and a 
conditional graphical model (conditioned on the 11 latent variables) with 221 edges for a total of 
913 parameters. The model obtained using SDR-GM consists of a 4-dimensional projection of the 
7 covariates (that is sufficient with respect to the responses) and a conditional graphical model 
over the responses (conditioned on the dimension-reduced covariates) that consists of 564 edges 
for a total number of 908 parameters. Finally, the model obtained using SDR-LVGM consists of 
a 3-dimensional projection of the covariates, 7 latent variables, and a conditional graphical model 
(conditioned on the dimension-reduced covariates and the 7 latent variables) with 180 edges for a to¬ 
tal of 908 parameters. Let ©gm £ S+^+) ©lvgm £ S+^+! ©sdr-gm £ and ©sdr-lvgm £ 

denote the precision matrices of the models corresponding to GM, LVGM, SDR-GM, and SDR- 
LVGM respectively. Although these models have similar complexities, they have different predictive 
performances, as described next. 

We assess the predictive performance of each of these four models on 90 monthly observations 
{(^est) ^testlE^i ^ ^ from 2006 to 2013 (recall that the training set based on which the four 

models were obtained consisted of 188 monthly observations during the period 1990 to 2006). For 
the models obtained via GM and LVGM, we compute the average log-likelihood over the test sam¬ 
ples using the distributions specified by the precision matrices ©gm and ©lvgm respectively. For the 
model obtained via SDR-GM that accounts for the influence of the covariates on the responses, we 
compute the log-likelihood of each test sample (and subsequently average over all test samples) 
with respect to the distribution of Y\[f{X) = /(V^4\)] where f{X) is the projection of X into the 


row-space of the matrix —(©sDR-GM)y (©sdr-gm)vx (recall that —(©sDR-GM)y (©sdr-gm)w 


de¬ 


notes the map of best linear estimator of Y based on V). We follow a similar approach to compute 
the predictive performance of the model obtained via SDR-LVGM. The average predictive log- 
likelihoods of models obtained via GM, LVGM, SDR-GM, and SDR-LVGM are —127.55, —122.73, 
—121.12, and —120.28 respectively. For comparison, a model obtained via FM (factor modeling) 
on the training set - without incorporating the covariates - consists of 14 latent factors (for a 
total of 914 parameters) and gives an average predictive log-likelihood of —123.99 over the test set. 
On the other hand, a model obtained via SDR-FM (using the estimator (1.2) with the regularize!’ 
R(©y) set as in (1.3)) on the training set - that incorporates observations of both the covariates 
and the responses - consists of a 5-dimensional projection of the covariates and 8 latent factors in 
the conditional model of the responses given the dimension-reduced covariates (the total number of 
parameters equals 920), and it provides an average predictive log-likelihood of —121.35 on the test 
set. As larger values of average log-likelihood are indicative of a better ht to the test samples, these 
results suggest that SDR-LVGM offers the best predictive performance of the different approaches 


^The number of parameters required to specify a sparse graphical model with a precision matrix N ^ is equal 
to p plus one-half the number of nonzero off-diagonal entries of N (as N is symmetric). The number of parameters 
required to specify a pxq matrix with rank r < min{p, q} is equal to r{p + q) — . Finally, the number of parameters 

required to specify a matrix in with rank r < p is equal to rp — r(r — l)/2. 
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considered in this experiment. 

Focussing on the structural aspects of the models obtained via GM, LVGM, SDR-GM, and SDR- 
LVGM, Fig. shows the (conditional) graphical model structures over the responses corresponding 
to each of these approaches. The ten strongest edges in the conditional graphical model obtained 
via SDR-LVGM in Fig. [^d) (in terms of the magnitude of the entries in the precision matrix) are 
between General Electric - American Express, Target - Bancorp, Hewlett Packard - Oracle, Texas 
Instruments - General Electric, Occidental Petroleum Corp. - Wells Fargo, American Insurance 
Group - Bank of New York, Merck &: Go. - Walgreens, JP Morgan - Verizon, Verizon - GVS 
Health, Pfizer Inc. - Colgate. The presence of some strong edges between entities in different 
industries suggests that dependencies between assets belonging to the same industry may be better 
modeled via the latent variables or the dimension-reduced covariates. Each of these ten edges in 
the conditional graphical model obtained via SDR-LVGM also appears as a strong edge in the 
conditional graphical model obtained via LVGM (in which the covariates are not incorporated). 
Examples of other strong edges in the conditional graphical model obtained via LVGM (which do 
not appear in the conditional graphical model obtained via SDR-LVGM) include IBM Corp. - 
American Insurance Group, Hewlett Packard - Southern Company, Hewlett Packard - Walmart, 
General Electric - Colgate, Fedex - Emerson. 



(a) GM 


(b) LVGM 


(c) SDR-GM (d) SDR-LVGM 


Figure 1: These figures show the sparsity pattern (black denotes an edge, and white denotes no edge) of the 
graphical models associated with each modeling paradigm 

Turning our attention to the latent components identified in the models obtained via LVGM 
and SDR-LVGM, we note that the number of latent variables in the model obtained using SDR- 
LVGM (7 variables) is smaller than that in the model obtained via LVGM (10 variables). This 
observation suggests that the 3-dimensional projection of the covariates in the model obtained via 
SDR-LVGM accounts for some of the effect of the latent variables in the model obtained using 
LVGM (which does not incorporate the covariates). To illuminate this point quantitatively, let the 
matrix F = — (0sDR-LVGM)y^(0SDR-LVGM)yv G denote the map of the best linear estimator 

of Y based on X (specified by the model obtained via SDR-LVGM) and let the matrix A G 
denote the map of best linear estimator of Y based on the latent variables f G (specified by 
the model obtained via LVGIVQ. The column space of F corresponds to the 3-dimensional image of 
the mapping of the best linear estimator of Y based on X, and it represents the component of the 
response Y that is correlated with the covariates X in the model obtained using the SDR-LVGM 
approach. Similarly, the column space of A represents the 10-dimensional component of Y that is 
correlated with the latent variables C in the model obtained using the LVGM approach. As such, 
the closeness between these column spaces measures the degree to which the sufficient dimension 
reduction f{X) accounts for some of the influence of the latent variables C on the covariates Y. 
The largest principal angle between the 3-dimensional column space of F and the 10-dimensional 
column space of A is 11.06 degrees. This result indicates that the dimension reduced covariates in 

^The matrix A is only known up to right-multiplication by a non-singular linear transformation. As a result, the 
key invariants are the rank and the column space of A. 
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the model obtained using SDR-LVGM account for some of the effect of the latent components in 
the model obtained via LVGM. 


3.2 Combining Covariate Selection and Graphical Modeling 


In this section, we investigate the performance of the estimator (1.7) on the financial and the 
newsgroup datasets. Recall that this estimator selects a subset of the covariates X that is most 
useful for predicting the responses Y, while simultaneously fitting a latent-variable graphical model 
to the conditional distribution of the responses given selected covariates. We denote this modeling 
approach as GS-LVGM. 

We apply the estimator (1.7) to the financial and the newsgroup datasets with fixed choices for 
parameters \n,5 and different choices of the parameter 7 . The objective of this experiment is to 
illustrate the different covariates selected in each dataset as 7 is varied. For each of the models 
obtained using this GS-LVGM approach, Table [T] and Table [2] list the subset of covariates that were 
selected (recall that the selected covariates are represented by the indices of the nonzero columns 
of the submatrix Qyx £ of the estimated joint precision matrix As expected, larger 

values of 7 yield a smaller subset of relevant covariates as the regularization term 7 || 0 yx||i ,2 in ( |1.7| ) 
is enforced more strongly. With the financial dataset, the covariates Xi (population growth rate) 
and V 5 (EUR to USD exchange rate) persist as 7 increases, which suggests that they are the most 
relevant for predicting stock returns among the seven covariates considered in the experiment. On 
the other hand, the covariate V 3 (oil exports) appears to be the least influential. For the newsgroup 
dataset, the covariates Xi (government) and Xj (world) persist as 7 increases, suggesting that these 
are the most useful for predicting word occurrences in documents in the 20 newsgroup dataset, while 
the covariates Xq (medicine) and Vg (food) do not seem as relevant. 


7 

covariates identihed 

1.39 

2.32 

2.80 

3.02 

{XyX2,X3,X^,X5,X7,Xg} 

{Vi,V2,V3,X5,V9} 

{Vi,V2,X3,V9} 

{^ 1 ,^ 7 } 


7 

covariates identified 

1.50 

1.99 

2.14 

2.57 

{Vi,V2,V4,V5,V6,V7} 

{Vi,V2,V4,V5} 

{Xi,X5,Xr} 

{^ 1 ,^ 5 } 


Table 1: 7 vs selected covariates for finan- Table 2: 7 vs selected covariates for news- 

cial dataset with A„ = 0.58 and S = 0.29 group dataset with A„ = 0.38 and S = 0.41 

We inspect more closely the model obtained using the GS-LVGM approach on the newsgroup 
dataset with parameters Xn = 0.38, d = 0.41,7 = 2.32 (this corresponds to the second line in 
Table . This model consists of 5 selected covariates, 6 latent variables, and a conditional graph¬ 
ical model (conditioned on the 5 covariates and the 6 latent variables) with 10 edges for a to¬ 
tal number of 1087 parameters. The 10 edges in this conditional graphical model are between 
God-Jesus, Dos-Windows, Bible-God, card-video, email-phone, Christian-God, state-university, 
computer-university, disk-drive, and Israel-Jews. For comparison, we obtain a model using the 
LVGM approach (ignoring the covariates) of comparable complexity with a total of 1083 param¬ 
eters - this model consists of 11 latent variables and a conditional graphical model (conditioned 
on the 11 latent variables) with 46 edges. Each of the ten edges in the conditional graphical 
model obtained via GS-LVGM appear as stronger edges in the conditional graphical model ob¬ 
tained via LVGM. Examples of additional strong edges in the conditional graphical model obtained 
via LVGM include patients-disease, hockey-NHL, and players-baseball. The edge between patients- 
disease presents an interesting illustration as these two words appear together in 64 documents. In 
45 of these 64 documents, however, at least one of the 5 covariates selected by the GS-LVGM model 
(Vi, V 2 , V 3 , V 5 , Vg from Table[^ also appears. Thus, the lack of an edge between patients-disease 
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in the conditional graphical model obtained using the CS-LVGM approach is perhaps explained 
by the inclusion of the the 5 covariates. In a similar vein, the absence of the edges hockey-NHL 
and players-baseball in the conditional graphical model obtained via CS-LVGM may be attributed 
to the inclusion of the covariate Xg (game). In contrast, a curious point arises when considering 
the presence of the edges God-Jesus, Bible-God, Christian-God, and Israel-Jews in the conditional 
graphical model obtained via CS-LVGM. From a preliminary inspection, one might expect that the 
inclusion of the covariates V 2 , V 7 (religion, world) would account for these four pairs of interac¬ 
tions, resulting in their absence in the conditional graphical model. However, (as an example) the 
words God and Jesus appear together in 380 documents, and only 180 of these documents include 
at least one of the 5 selected covariates. Thus, it is not surprising that the edge God-Jesus remains 
in the conditional graphical model obtained using the GS-LVGM approach despite this graphical 
model being conditioned on the 5 selected covariates. 


4 Further Directions 


The selection of regularization parameters is a common practical challenge in high-dimensional 
estimation problems. Approaches based on cross-validation are widely used, although these tech¬ 
niques optimize for prediction performance and do not always yield concise models. To address 
this shortcoming, several methods have been proposed recently for the selection of regularization 


parameters for the Lasso (Meinshausen & Biihlmann, 2010). It is of interest to extend these ideas 


to our estimator (1.2), which requires the specification of multiple regularization parameters. Fur¬ 


ther, the convex programs proposed in this paper are computationally tractable (i.e., solvable to a 
desired accuracy in polynomial time), but it would be useful to develop special-purpose numerical 
schemes - perhaps building on recent scalable algorithms for the Graphical Lasso ( Friedman et al.| 
2013) - for efficient solution in massive-scale problems. Finally, it is of interest 


2008; Hsieh et al. 


to extend our techniques to non-Gaussian settings, in which one may be interested in identifying 
nonlinear dimension reductions of the covariates that are sufficient with respect to the responses. 
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5 Appendix 


5.1 Proof of Theorem 12.11 


In this section, we prove the consistency results (stated in Theorem 2.1) of the estimator (1.6). 
The high-level proof strategy is similar in spirit to the proof of the consistency results for sparse 


graphical model recovery (Ravikumar et ah, 2008) and latent variable graphical model recovery 


(Chandrasekaran et ah, 2012). However, the estimator (1.6) is different than the estimators pro¬ 


posed by (Ravikumar et ah, 2008) and (Chandrasekaran et ah 2012) due to the nuclear norm 


penalty corresponding to the SDR objective. 


We begin by considering the following convex optimization program: 


{Q,Sy,Ly)= argmin 
e&i+p, e^o 
Sy,Ly&^ 

s.t. 


-^(0; {X«, + A„[h||5v||,, + ||Ly lU + 7||0YX||. 


0y = Sy — Ly 


(5.1) 


Comparing (5.1) with the convex program (1.6), the difference is that we no longer constrain Ly to 


be a positive semidefinite matrix. In particular, if Ly ^ 0, then the nuclear norm of the matrix Ly 
in the objective function of (5.1) reduces to the trace of Ly. We show that the unique optimum 
(0,5y,Ly) of (5.1) has the property that with high probability, Ly is positive semidefinite. As 
a result, with high probability, the variables (0,Sy,Ly) are also the optimum of (1.6). In the 


remainder of this section, we show that under the assumptions of Theorem 2.1 


variables (0, Sy, Ly) are structurally correct estimates of (0*, Sy, Ly) (see Dehnition 
we outline our proof strategy: 


the primal feasible 
Below, 


2 . 1 ) 


1. We proceed by analyzing (5.1) with additional constraints that the variables Sy, Ly, and Qyx 


2 . 


belong to the algebraic varieties of sparse and low-rank matrices (specified by the support 
of Sp and rank of Ly and 0yx) ) that the tangent spaces D(Sy), T{Ly), T{Qyx) 
are close to the nominal tangent spaces D(Sy), T{Lp), and T{Qp^) respectively. We prove 
that under suitable conditions on the minimum magnitude nonzero entry of Sy, minimum 
nonzero singular value of Ly, and minimum nonzero singular value of 0yx, any optimum 
set of variables (0, Sy, Ly) of this non-convex program are smooth points of the underlying 
varieties; that is sign(SY) = sign(SY), rank(LY) = rank(LY) and rank(0Yx) = rank(0Yx)- 
Further, we show that Ly has the same inertia as Ly so that Ly ^ 0. 

Conclusions of the previous step imply the the variety constraints can be “linearized” at 
the global optimum of the non-convex program to obtain tangent-space constraints. Under 
suitable conditions on the regularization parameter A^, we prove that with high probability, 
the unique optimum of this “linearized” program coincides with the global optimum of the 
non-convex program. 

3. Finally, we show that the tangent-space constraints of the linearized program are inactive 
at the optimum. Therefore, the optimal solution of has the property that with high 
probability: sign(SY) = sign(SY), rank(LY) = rank(LY), and rank(0Yx) = rank(0Yx)- Since 


Ly ^ 0, we conclude that the variables (0, Sy, Ly) are the unique optimum of (1.6). 


In Section 5.1.1 we prove the results of step 1. In Section 5.1.2 we prove the results of step 2. 


Finally, in Section 5.1.3 we prove the results of step 3. 
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5.1.1 Variety Constrained Optimization Program 

Letting m = max{j, 1, and ^ = ||(0*)~^||25 we consider the following variety-constrained opti¬ 
mization program: 


(0"^, 5^, = argmin 

e&si+p, e;-o 

Sy,Ly&Sp 

s.t. 

Here, the set M. is given by: 


-^(0; {VW, + A„[5||5y II,, + ||Ly lU + 7||0YX||. 


0y = Sy — Ly, (0, Sy, Ly) G Al. 


(5.2) 


4 |(0, Sy, Ly) e §(^+9) xSP x§p Sy e n{S^), rank(Ly) < rank(L^) 

rank(0yx) < rank(0yy-); 

\\'Pt(L*^)^{Ly - Ly )||2 < 

ll^r(e^^)i(0Yx - 0 yx)ll 2 < ^^^2 

^s,-y[Ah*AA] < 5A,,} 


The optimization program (5.2) is non-convex due to the rank constraints rank(Ly) < rank(Ly) 
and rank(0yjj(') < rank(0y-Y) in the set Ai. These constraints, in addition to the constraint 
Sy G H(£'y) ensure that the matrices Sy,Ly, and &yx belong to appropriate varieties. The 
constraints in Ai along T(Ly)-*- and T(0yy.)-*- ensure that the tangent spaces T{Ly) and T{Qyx) 
are “close” to T{Ly) and T{Qyx) respectively. Finally, the last condition roughly controls the 
error. In this section, we will prove that any feasible set of variables (0, Sy, Ly) - and in particular 
an optimal set of variables (0-^, S'y^, Ly*)- is structurally correct estimate of {Q*,SP,Ly)- We 
begin by proving that any feasible set of variables (0, Sy,Ly) is “close” in norm to the population 
quantities {Q*,Sy,Ly)- 


Proposition 5.1. Let (0,5y,Ly) be a set of feasible variables of (5.2). Let A = {Sy — Sy,Ly — 
L*y, Qyx - Q^yx, - Q^x) andCi = ^ + ^. Then, < CiA„ 

Proof Let H* = H(5^) x r(L* ) x T{e*Yx) x Then, 

^s,j[A^rAVu*{A)] < ^>5,.,[AltrAl(A)] + $5,.,[AltlMpH.x(A)] 


< 5A,i -|- mif' 


, /OJy^ 


+ 


OJYX^n 


\2m'ip^ 217111) 


2 — 


< 6A,; 


Since ^ 5 , 7 [Ph*(')] ^ 2 <h 5 ^..y(-), we have that <h 5 _..y[PH*Al^I*AlPH*(A)] < 12A„. Consequently, we 
appeal to the Fisher information Assumption 1 in (2.5) to conclude that <l> 5 ^..y[Pe*(A)] < 
Moreover: 


^>5,7[A] < ^5,..y[PH*(A)] $5,..^[P]jj*i(A)] < 


12A. 


a 




□ 


Proposition 5.1 leads to powerful implications. In particular, under additional conditions on 
the minimum magnitude nonzero entry of Fy, and minimum nonzero singular values of Ly and 
Qyxj feasible set of variables {Q,SY,Ly) of (5.2) has two key properties: (a) The variables 
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{&YX, Sy, Ly) are smooth points of the underlying varieties, ( 6 ) The constraints in Ai along 
T(Ly)-*- and T{Qyx)'^ locally inactive at &yx and Ly. These properties, among others, are 
proved in the following corollary. 


Corollary 5.2. Consider any feasible variables {Q,Sy,Ly) of (5.2). Let ay be the smallest 
nonzero singular value of Ly, ayx be the smallest nonzero singular value of Qyx) ''"K the 
minimum magnitude nonzero element of Sy- Let H' = n(5y) x T{Ly) x T{Qyx) x o,nd Ct’ = 
Pe,±(O,L^,0^^,O). Furthermore, let Ci = ^ C 2 = ^(1 + ^), max{12/3 + 

1, ^^^^2 + 1} o.iT'd max{12/3 + • Suppose that the following inequalities 

are met: ay > ^C^^Xn, ayx > - 2<5C'iAn. Then, 

1. Ly and Qyx are smooth points of their underlying varieties, i.e. rank(LY) = rank(LY), 
rank(0Yx) = rank(0Yx)/ Moreover Ly has the same inertia as Ly. 


S- W-PTfL-Xar - i;')l |2 < and ||PT(e^^)401-X - e;.A-)ll 2 < ^ 

3. p{T{Ly),T{Ly)) < ujy, and p{T{Qyx),T{@y^)) < ujyxi that is, the tangent spaces at Ly 
and 0 yx are “close” to the tangent spaces at Ly and Qyx respectively. 

4. ^ 

5. ^s^^[Cti] < C2Xn 

6. S is the smooth point of its underlying variety, i.e. sign(SY) = sign(SY) 


Proof. We note the following relations before proving each step: Ci > ^ ajyjOOyx £ [0,1], 


and fd = > g for n G (0,1/3]. We also appeal to the results of (Kato 


1995 


Bach 


2008 


Chandrasekaran et ah, 2012) regarding perturbation analysis of the low-rank matrix variety. 


1. Based on the assumptions regarding the minimum nonzero singular values of Ly and 0yx; 
we have: 

C^X C X 

ay > ^^7711/^(12/3 + 1) > ^^(12/3 + 1) > (12/3 +l)CiA„ > 8 CiAn > 8 ||L-L ^||2 

(jJy OJy 

aryx > + 12/3) > ClX.^py^mif^— > 87 C'iA„ > 8||0yx - 0 yxll 2 

Uyx V 7 / 7 


Combing these results and Proposition |5.1[ we conclude that Ly and Qyx are smooth points of 
their respective varieties, i.e. rank(Ly) = rank(Ly), and rank(0yx) = rank(0yY.)- Furthermore, 
Ly has the same inertia as Ly. 

2. Since cry > 8 ||Ly — Ly|| 2 , and ayx > 8||0yY — 0yYll2) can appeal to Proposition 2.2 in 
(Chandrasekaran et ah, 2012) to conclude that the constraints in M. along and 

are strictly feasible: 


\{PT{LtYp{d^Y — Ly)\\2 < 

II^T(e^^)-L(0FX — 0yY)ll2 < 


||Ly Lylll ^ _ CfX'^UJY _ ^ Xn 

cry C^A„mi/2(i2/3 + 1) “ 

l|QyX ~ Qyxili _ X'^UJyX ^ An 

C2Anmi/2^2^f + 12^) “ 48mi/2 
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3. Appealing to Proposition 2.1 in (Chandrasekaran et al., 2012), we prove that the tangent 
spaces T{Ly) and T{Qyx) are close to T{Ly) and T{Qyx) respectively: 


. An/r*\\ 2II Ay — II 2 2CiXnU)Y 

,iT(Ly),nm < ,, 


cryx 


C'2A„mV’272f 


< Wyx 


4. Letting cjy and ct’yx be the minimum nonzero singular value of L and 0yx respectively, 
f Vl Q f • 


we 


note that: 


n'i \ 

J Hr r* II ^ 

CTy > cry — ||Ay — Ay||2 ^ 


> 


OJy 

ClK 

OJy 

CfXn-f^ 


milP‘{12(3 + 1) — CiAr 


I2m'ip'^l3 > SCiXn > 8||Ly - L\ 


'y||2 


^YX — rryx ~ ||0yx ~ ©yxlb ^ -h 12/?') — CiXn'y 

OJYX ^ 1 ^ 


> 


OJYX 

- CnA„ > 8C,A,.^ > sue,.,, - 

OJYXl 


Once again appealing to Proposition 2.2 in (Chandrasekaran et ah, 2012), we have: 

^5,-t{CT') < r?r||Py(iy)±(Ly — Ly)||2 + m||Py(0y^)±(0yx 

||Ay — Ay III ||0yx — Oyxili 


- 0 yx)ll 2 


< m ——+ 


cr 


y 


a 


YX 


< 


< 


mCfX: 


a \2 




C^A„mV>2(i2/3+l) 


bJY 


_I_ / 

— CiAn CjA„ 72 mV >2 ^^ + 12/3^ 


XnOJY , XnOJYX ^ Xn 


ujyx 


- ClXnJ 


12/3'0^ 12^^^ 

This leads to the result that 4>5^^[Al^I*AlC'y'] < 

5. Following the same reasoning as step 4, we conclude: 

\ ^ ll-^y^-bylli , I|0y^“0yxll2 

^5,7(yT') < m -^- \-m -^- 

(Tty (7 YX 


mCfX 


i2 \2 

'n 




+ 


-■— 

C2^Y , , C2i^YX , ^ r' \ 

— o ' o 




- CiAnT 


6. This fact follows immediately since US'y—5y ||oo < ^CiXn and the smallest nonzero magnitude 
entry of 5y is greater than 25CiXn- □ 
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5.1.2 From Variety Constraints to Tangent-Space Constraints 


Consider any optimal solution of (5.2). In Corollary 5.2 we concluded that the 

variables (0^, S^,L^) are smooth points of their respective varieties. As a result, the rank con¬ 
straints rank(LY) < rank(LY) and rank(0Yx) < rank(0Yx) can be “linearized” to Ly G T{L^) 
and Oyx £ ^(0^^) respectively. Since all the remaining constraints are convex, the optimum 
of this linearized program is also the optimum of (5.2). Moreover, we once more appeal to 
Corollary 5.2 to conclude that the constraints in Ai along 'Py(l* ^cid 'Py( 0 * strictly 


feasible at (0'^,5^,L^). As a result, these constraints are locally inactive and can be re¬ 
moved without changing the optimum. Finally, we claim that the constraint < 5A,i 

in (5.2) can also removed in this “linearized” convex program. In particular, letting EI_yv( = 


Q{Sy) X T{L^) X X S'? (note IHI_a4 G U{ujy,u}yx) where the set U{ujy^ ojyx) is defined in 

( |2.3[ )), consider the following convex optimization program with the constraint < 5An 

removed : 


(e,SY,LY) = argmin -£(0; {X», + A4<5||5 y||,, + ||Ly|U + 7||0Yx|k] 

0es«+p, 0^0 
Sy 

s.t. &y = Sy-Ly, {Sy,Ly,Qyx,Qx) (5-3) 


In the following theorem, we prove that under additional conditions on the regularization parameter 
A,i of (5.3) , the set of variables (0"^, L^) is the unique optimum of (5.3). 

Proposition 5.3. Let C' = {2 + (5deg(S^) 7 )^’, Ci = ^ 0*2 = ^(1 + and C'^amp = 

max 4 C 2 C", 32mV>C C 2 ^ . Suppose that the number of observed samples obeys 

n > 4608/3“^Aq), and the regularization parameter An is chosen in the following range: 

Then, with probability greater than 1 — 2exp| 


A- 


6/3 


128(p+g)m2^2 


’ C 

^ samp _ 


nXl 


4608/3^m^i/i^ 


(e,Sv,Zy) = (e-«.s7,L7). 


}. 


Proof. The high-level proof strategy is to show that the constraint $ 5 ^.^ [MiPMA] < 5An is inactive 
at the optimum of (5.3). That is, we show that — Sy, Ly — Ly, Qyx — &yx^ ®x — 

0y)] < 5A„. The proof of this fact relies on the results of the following lemmas: 

Lemma 5.4. Let A = {Sy — Sy, Ly — Ly, Qyx — ®yx’ “ ®x)- Denote 

Rs.{A) A //^ 5 ,^[A] < 2 ^, then: $ 5 ,.,[Mti?s*(A)] < 2mf;C>^4>s,.y[A]‘^. 

Proof. We begin by introducing a quantity that plays an important role in the proof of this lemma 
and was also employed in Chandrasekaran et al. (2012). Given a symmetric p x p matrix M, we 
define the quantity p{Ll[M)) with respect to the tangent space Ll{M) of the variety ofpxp matrices 
with at most |support(M)| nonzero entries: 


uiLl(M)) A max 
AreO(M),||7V| 


=1 


(5.4) 


One can show that a sparse matrix M with “bounded degree” (a small number of non zeros 
per row/column) has small p{M). Specifically, for any p x p matrix M, we have p,{Ll{M)) < 
deg(M) where deg(M) is equal to the maximum number of non zeros in any column/row of M 


24 























(Chandrasekaran et al., 2012). We now proceed with the proof: 


OO 


Al 


< 


m'0 


k=2 


k=2 


+ 7 


||A 0 yx|k 

7 


+ ||A0x||2 


< 


+ 7 


milj y^^'=(5deg(5^) 

k=2 

||A0yx||2 


l|A5y||, 


+ IIAl- 


Y 2 


7 


+ ||A0x||2 


||A5y||, 


+ IIal 


Y\\2 


^ (2 + Weg(5}) + ^ 2mi,C"-S,,,[^f 

^ 1 - (2 + (5deg(5^) + 7 )$ 5 ,^[A]V' 

Note that the second inequality employs the property that {Sy — Sy) G fl(5'y) and the quantity 

□ 


defined in (5.4). 


Lemma 5.5. Let Ctj^ = (0, Ly, 0yj(-, 0). Furthermore, let En = 'Fn — S*, and A = (5y — 

Sy, Ly — Ly, Qyx — ©yx’ “ ©x)' define: 

r = max ^^(^^s,y[-^^En +Ah* ACtm] +\n^, ^s,h^TM\'\ (5A) 


Ifr< min{^, t^^n 4>5,7[A] < 2r. 

Proof. The proof of this result uses Brouwer’s fixed-point theorem, and is inspired by the proof of 
a similar result in (Ravikumar et al., 2008; Chandrasekaran et al. , |2012 ). The optimality conditions 
of (|5.3| ) suggest that there exist Lagrange multipliers Qq G n(5y)“‘“, Qty £ F{Ly)-^, and Qtyx ^ 
T{@yx)'^ such that 


[Sn — 0 ^]y + G — A„(5cl||5y||£j; [S„ — 0 ^]y + G A„9||I/y||* 

[S„ - 0“^]yx + QTyx ^ -^n75||0yx||*; pn - 0“^]x = 0 

Letting the SVD decomposition of L and 0yx be given by Ly = UDV' and 0yx = UDV' 
respectively, we can restrict the optimality conditions to the space EI_yv( to obtain: 


pH^Alt(S„-0-i) = (-A„5sign(5y), XnUV', -XnlUV', 0) 


(5.6) 


Based on the Fisher information Assumption ( |2.5| ), the optimum of (5.3) is unique (this is because 
the Hessian of the negative log-likelihood term is positive definite restricted to the tangent space 
constraints). Moreover, using standard Lagrangian duality, one can show that the set of variables 
(0,Sy,Z/y) that satisfy (5.6) are unique. The matrix inversion lemma allows one to express 0“^ 
equivalently by: 


0-^ = [0* + ^(A)]-^ = S* - Re*(A) + rM(A) 


Setting Z = (—An5sign(5), XnUV', —Xn'^UV', 0), relation (5.6) can be restated as: 

- RePA) +rAl(A)) = Z 


(5.7) 
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Notice that = A„. We now appeal to Brouwer’s fixed-point theorem to bound 

Consider the following function G{5) with 6 G U(ujy,u)yx)- 

G{0 = 6- {Vuj^Ah^AVmj^r^ (VuM^^[En - i?s*Al(<5 + Ct^) + IM(<5 + Cr^)] - z) 

Note that the function G{6) is well-defined since the operator i® bijective due to 

Fisher information Assumption 1 in ( |2.5[ ). As a result, 5 is a hxed point of G(5) if and only if 
Vuj^A^En — R'e*A{6 + Gtj^) +1* A{^ + Gtj^)] = Z. Since the variables (0, Sy, Ly) are the unique 
solution to (5.3), the only hxed point of G is Next we show that this unique optimum 

lives inside the ball = {5 | < r, 6 G IHI_a 4 }. In particular, if we show that under the 

map G, the image of B^ lies in B^, we can appeal to Brouwer’s hxed point theorem to conclude 
that [A] G Br- For 5 G B^, <I>5^.y[G(5)] can be bounded as follows: 


^ 5 , 7 [G( 5 )] = [5 - {Vmj^Ah*AVnj^)-^{rmj^A^[En - Ry^A{S + Gt^) 

+ rA{6 + CT^)]-z] 


= *^(5,7 ^ [En - Rs* A{6 + Gtj^) 




< -^>5,7 

a 
2 

< - 
a L 
r 


Rmj^A\En - Ry*A{6 + Ct^) + rA(Cr^)) - Z 


^&,^[A\En + rA{CT^))] + \n\ + 


- 2 


a 


The hrst inequality holds because of Fisher information Assumption 1 in (2.5). The second inequal¬ 
ity uses the property ‘h5^..y[7^e^ (•)] ^ 2<h5^.y(.) and ^s^.y{Z) = A„. Moreover, since r < we have 
‘^( 5 , 7 + Etj^) < <1*5,7(^) E ^<5,7(^Tm) ^ 2r < We can now appeal to Proposition 1 to obtain: 

^<h5,7[Al^^s*(<i + Cr^)] < ^mV*C'2[<I>5,7(5 + CT^)]2 


< 


16mV’„,2 2 

- C r = 


a 


16mV*„,2 ■ 

-C r 


a 


r 

r < - 
- 2 


Thus, we conclude that <I>5,..|,[G(5)] < r and by Brouwer’s hxed-point theorem, <I> 5 ,..),[Ph^ (A)] < 
r. Furthermore, 


<h5,7[A] < <I>5,..y[PHAi(^)] + ^&niETM) ^ 2r 


□ 


Lemma 5.6. Suppose that the number of observed samples obeys n > 4608/3^m^V’^C'samp(P+^)j 


the regularization parameter Xn is chosen in the following range: A„ G 


Then, with probability greater than 1 — 2exp| 


4608/3^ 


}, 4*5,7 


nj _A aa • 


6^‘ 


Proof. First, note that <h 5 ,..y[Al^£'„] < m||S„ — T*|| 2 - Using the results in (Davidson &: Szarek 


2001) and the fact that ^ < 8ip and n > ^ following bound holds: Pr[m||S„ — 

2exp| 1■ Thus, ^s^jiA^En] < ^ with probability greater than 1 - 2exp| - 


s*ii2 > i^] < 

n\l 




}■ 


□ 
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Proof of Proposition 2: We now proceed with completing the proof of this theorem. In partic- 

< 5A^ 


ular, we show that — Sy^Ly — Ly, 


Qyx—&YX 


,ex-e*x) 


Based on the 


optimality condition (5.7) and the property that (■)] ^ we have: 


^> 5,7 [Pum A^I^AVwm ('^)] 


< 2Xn + 2<I>5^-y[Al'^i?2*(A)] + ACtj^] 

+ ‘ 2 :^&,^[A^ En\ 


(5.8) 


Appealing to Corollary 5.2 and Proposition 5.6, we have that ACy^ ] < ^, ^>5,7[Ct'] < 

C 2 Xn and (with high probability) En] < Consequently, based on the bound on A^ i 


m 


assumption of Theorem 5.3 it is straightforward to show that r < min{^, Hence by 

Proposition 5.5, <I> 5 ^.y[A] < Finally, we can appeal to Proposition 5.4 to obtain: 


^5,7[-4'fi2s*(A)] < 2mV’C'2$5,.^[A]2 < 2rmljC'‘^ClXl < \l2^miljC'^ClX. 


An ^ An 
- GP ~ 6(3 


where the second to last bound comes from the bound on An- Thus, the expression in (5.8) can be 
further simplified to: 


+ 2Anf^ P ErP Ffl) — -^ ^ 


An , 17A^ 

J 


The last bound follows since (3 > 8. Furthermore, 


4>,,.,[AltlM(A)] < cD,^^[Pjj^^tr^PH^(A)] + ch5,.,[Pe^AltlMPH^(A)] 
+ (A)] 

+ ^S,-y[Ah^ACTj^, 

< 


17 An 17An/.. ^ An 17 An 17 An An 


Note that we appeal to Fisher information Assumption 2 in (2.6) in the second inequality. 


5.1.3 From Tangent Constraints to the Original Problem 


□ 


Finally, we show that the tangent-space constraints in (5.3) can be removed without altering the 
optimum value. More concretely. 

Proposition 5.7. Suppose that the number of observed samples obeys n > 4608/3^m^V’^Csamp(P + 
q), and the regularization parameter An is chosen in the following range: 


An £ 


6/g / 128(p+g)m^V>^ ^ 1 

^ ^samp 


. Then, with probability greater than 1 — 


optimum of (5.3) is the same as the optimum of (5.1). 


2 exp I 


46O8/3^m^'0- 




the 
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Proof. Note that the bound on n implies that the assumptions of Proposition 5.3 are satisfied. 
Hence, we conclude that the solution of the tangent-space constrained program (5.3) is the same 
as the global optimum of the variety constrained program (5.2) . Next, we show that the optimal 
variables of (5.3) remain unchanged once the tangent-space constraints are removed. We proceed 
by proving that the optimum set of variables (0, S'y, Ly) satisfy the optimality conditions of (5.1) 
given by: 


[Tin — 0~^]y £ —An59||5y 11^^, \Tn — 0~^]y G An5||Zy||* 
[Tn - 0“^]yx G -An9||0yjs:||*, [Tn - 0~^]x = 0 
Equivalently, we show that (0, Sy, Ly) satisfy the following set of equations: 

1. iPH^Alt(S„-0-i) = (-A„7isign(5y), XnUV - \nl 2 UV', 0) 

2 . “0 ^)] < A^ 


Here, UDV' is the SVD decomposition of Ly and tjDV' is the SVD decomposition of 0yx- It is 
clear that the first condition is satisfied since the variables (0, S'y, Ly) are optimal with respect to 
(5.3). To prove that the second condition is met, it suffices to show that: 

AVwi^{/T)] < \n - En] (5.9) 

- ^5,7['Pex AII/?s*(A)] - Ah*AC tjJ 


We first note: 


^5,'y\PmMAh*AVwij^i^)] < A„-I-2<h5^.y[AlIi2s*(A)]-I- 

^ I An _ (/3 -|- l)An 


Appealing to the Fisher information Assumption 1 in (2.5), we obtain: 


< An - $5,.,[AlIi?s(A)] - ^e,Mh*ACTj - $5,7 [All^n] 

< An - $5,..y[PH^AlIi?s*(A)] - $5,^[T’H^AlIrAlC'T^] 

— $<5,7 ['Ph^ Alikin] 


Here the last inequality holds since $ 5 , 7 [T’h^(.)] < $ 5 , 7 (. 


□ 


Proof of Theorem 1 : Letting fh = max{<5, 1 , 7 }, we note that C in Proposition 5.3 can be 
bounded as follows: C < 2'i/:deg(S'y)m. Further, one can check that the constants in Propo¬ 
sition 5.3 are related to the constants in Theorem 2.1 as follows: < mCaYx^ 

sequently Chmp h deg(Sy)C'samp- Under the assumptions of Theorem 2.1, we can appeal 
to Corollary |5.2| Proposition |5.3[ and Proposition |5.7| to conclude that with probability greater 

n\l 


than 1 — 2 exp-j — [j the variables (0,5y,Ly) are structurally correct estimates of 

(0*,5y,Ly); that is: sign(SY) = sign(SY),rank(LY) = rank(LY),rank(0Yx) = rank(0Yx) and 
the estimates {Q,Sy,Ly) are “close” to population quantities in appropriate norms: $ 5 ,..y[.S'y — 
Sy, Ly — Ly, Qyx ~ ®yxi “ ®x] — ^CiA^. 
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5.2 Choices of Parameters ^,5 in the estimator (1.6) 


In this section, we give conditions on the population Fisher information I* such that Assumptions 
1 and 2 in (2.5) and (2.6) are satisfied by a non-empty set of values of ( 5 , 7 . In the subsequent 


discussion in this section, we employ the following notation to denote restrictions of a subspace 
M = HixH 2 xH 3 xH 4 CSPx§Px RP^’i x §1 (here H 4 are subspaces in §p, SP, § 1 , 

respectively) to its individual components. The restriction to the first component of BI is given by 
BI[1] = Hi X {0} X {0} X {0} C SP X SP X x S'?. The restrictions EI[2],EI[3],EI[4] to the other 
components of H are defined in an analogous manner. 

As our first quantity, we consider the minimum gain of I* restricted to each of the tangent 
spaces P(S'y), and Ty, separately: 


wy, wyx) = min 


ojyx) 


mm 

MgM[{\ 




‘ 1,1 • 


(5.10) 


i=l,2,3,4 ||M||.j..=l 


Here the set U{ujy,ujyx) is defined in (2.3). Recall that this set denotes the distortions around the 
population tangent spaces Tp,Tp^). Notice also that there is no appearance of < 5,7 in the norm 
<1. The quantity 7 i(EI*; wy, cjyx) being large ensures that I* is well-conditioned when restricted 
to each of the tangent spaces H(S'y), Ty, Tyy. separately. The second quantity we consider is the 
maximal inner-product between elements in each of the tangent spaces H(5'y), Ty , Tyj^ and those 
in their respective orthogonal complements (again, in the metric induced by I*): 


? 72 (H*;a;y,a;yx) = max || 7 ^^(H[i])iI*^.A(H[i])(Ar )||2 

n&u(ujY, ^^Yx) ^ 

i=l,2,3,4 ||M||2=1 


(5.11) 


One additional aspect of Assumptions 1 and 2 (in (2.5) and (2.6)) that is not addressed via 
the quantities 71 (H*; wy, wyx), 72 (EI*; a;y, ojyx) is the gain of the population Fisher information I* 
restricted to Hy ©Ty. Controlling this gain ensures that the tangent spaces Hy and Ty have a trans¬ 


verse intersection in the metric induced by I*; as discussed in previous work by Chandrasekaran et 


al. ( 2012 ), such a property is critical to ensure the accurate estimation of the latent-variable graph¬ 
ical model specifying the conditional distribution of Y\f{X). Following the approach adopted in 

Ty via conditions involving three quantities. 

it is the maximum number of 


2.1 


that work, we control the gain of I* restricted to Hy 
The first quantity deg(SY) makes an appearance in Theorem 
nonzeros per row/column of Sy, and it denotes the degree of the graphical model structure under¬ 
lying the conditional distribution of Y\f{X), The degree of the sparse component Sy being small 
ensures that the graphical model underlying Y\f(X)A is indeed a sparsely connected structure. 
Bounds on the degree of a population graphical model play an important role in the literature in 
results on consistent graphical model selection ( Meinshausen &: Biihlmannl 2006 Ravikumar et 
al., 2008). The second quantity is an incoherence parameter, which played an important role in 


the literature on low-rank matrix completion (Candes & Recht 2009). Specifically, for a matrix 


N G , the incoherence of the row-space / column-space of N is given by: 


mc(N) = max max 

l<i<pi,l<j<pi 


{ 11 ^column-space (TV) (©) 11^2 5 lli^row -space (Ar)(ei)ll&} ( 


(5.12) 


where V denotes the projection operation and e* G denotes the z’th standard basis vector. 
The incoherence parameter of the low-rank matrix Ty being small ensures that the latent variables 
affect most of the observed responses Y. As developed by Chandrasekaran et al. (2012), the 


quantities deg(SY) and inc(LY) being small simultaneously ensures that the tangent spaces Hy = 
H(S'y) and Ty = T{Ly) are sufficiently transverse in the standard Euclidean inner-product. To 
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further ensure that the minimum gain of I* restricted to Oy © Ty is bounded below (i.e., to certify 
transversality of hly and Ty in the metric induced by I*), Chandrasekaran et al.| (2012) introduce 
the following quantity for W = x {0} x {0} x {0} C x x x 


r/ 3 (Oy, Ty; wy) = max • 


max 

p(Ty,T^)<u>y 


max max ||T^(w)I[*M ||2 


M&T{r 


Men* 

I|m||2=i 


(5.13) 


The reason for the statement of this definition in terms of the £oo and spectral norms is that these 
are the dual norms of the regularizers employed in ( |1.6[ ) (recall the discussion in Section 2 . 2 ). As 
shown by Chandrasekaran et al. (2012) and as described in the following proposition, suitably con¬ 


trolling the quantities deg(SY), inc(LY), ?? 3 (nY) Ty) ujy) leads to lower bounds on the minimum gain 
of I* restricted to fly © Ty, which enables the accurate estimation of the latent-variable graphical 
model underlying Y\f{X). 


In the following proposition, we describe a set of conditions on the quantities r/i(EI*; [uy ■, ojyx\)■, 
ri 2 {M*', \ujy,ujyx\)-, ^ 3 (f^y,Ty; wy), deg(SY), and inc(LY), which lead to Assumptions 1 and 2 (14) 
and (15) (main theorem) being satisfied for {6, 7 ) inside a polyhedral set. We explicitly characterize 
this set and show that it is non-empty. For notational convenience, we denote r/* = 77 ^(H*; ujy,LOyx), 
V 2 - V 2 (^*',^y,ujyx), and 773 = 77 ^(ny,T^;wy). 


Proposition 5.8. Fix a > 0,z^ G (0,1/3),cay > 0,cayx > 0. Let (3 = 
(i) vt > 2a, (ii) v *2 < min {a(l - 4 ^), 


III 


Suppose that 


and {iv) 


2 mc(L^)+a;-^ 

I—ljy 


deg(SY) < Then Assumptions 1 and 2 in (2.5) and (2.6) are satisfied for 


all (<5, 7 ) in the following non-empty polyhedral set: 


V{a,n,uiy,ujyx) = { ((^, 7 ) 


[2 inc(L^) + cay] , ^ ^ ^ 


4(1 — cay) 


max 1 1,772 deg(SY)(5- 



< 7 © 


deg(S^ 
min{5, 1 } a 

ri 2 ^ 


Conditions analogous to {in), (iv) appear in previous work on latent-variable graphical model se¬ 
lection ( Chandrasekaran et al.[ 2012), and in our context they are useful for ensuring structurally 
correct estimates of the latent-variable graphical model corresponding to the conditional distribu¬ 
tion of Y\f{X). Conditions {i), (ii) are relevant for simultaneously obtaining structurally correct 
estimates of the smallest dimension reduction f{X) and of the latent-variable graphical model 


specifying Y\f{X) via the convex program (1.6). See Dehnition 2.1 for more details. 


Proof. The proof of this proposition relies on two quantities. Given a matrix M G we defined 


the first quantity p.{Ll{M)) in (5.4) with respect to the tangent space Ll{M). Additionally we dehne 


the quantity ^(T(M)) with respect to the the tangent space T{M): 


i{T{M)) = max 
NeT{M),\\N 


2 = 1 


For extensive discussion regarding the properties of these quantities, we refer the reader to (Chan¬ 


drasekaran et al., 2012). Here, we highlight a few important facts. In particular, one can check that 
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^(T(M)) G [inc(M), 2inc(M)] and fj,{Q{M)) < deg(M) (recall that inc(M) measures the incoher¬ 
ence of M and deg(M) denotes the maximum number of nonzero elements in any column or row of 
M). Furthermore, given two linear subspaces Ti and T2, the quantity p{Ti,T 2 ) that measures the 


distortion of tangent spaces (see the definition in (2.3)) allows us to bound the variation in ^{T 2 ) 
as follows (Chandrasekaran et ah, [2012 ): 

Returning to the proof, we show that for any ((5,7) inside the polyhedron set V{a,i',LOY,ujYx), 
Fisher Assumption 1 and 2 in (2.5) and (|2.6[) are satished. First, using conditions (m) and (iv), 


one can check that the polyhedron set R(a, zz, wy, wyx) is non-empty. Now, let BI = fly x Ty x 
Tyx ^ be any subspace inside U{ujyiOJyx) and the tradeoff parameters be chosen so that 
(<5,7) G V{a^u^LjjY ,ojyx)- Further let Z = x x x S'?, and let {Sy, Ly,Qyx,&x) £ El 
with ||S'y||f^ < 6 , ||Zy||2 < 1, ||0yx||2 < 7) ||0x||2 < 1- Suppose equality holds in at least one 
of these set of inequalities so that ^s,'y{SY, Ly, Qyx, &x) = 1- Then, at least one of the following 
cases is active: 


1. Suppose ||S'y||f^ = (5. Then using conditions (i) — {in) of Proposition 5.8, we have: 


Ly, @yx, 0x )]| ki,i > 


> - 


^[||iPH[l]^^lM(5y,0,0,0)b,,, 
Ty, 0, 0)||£^ 

I|T’^(H[1])I*-4(0, 0,0yx, 0x)lkoc 

l|T’H[i]-T^I*-T(S'y, 0,0,0) 11$^ 1 
I|T’^(Z[1])I*‘^(0, Ly, 0,0)11^^ 
ll^^(Z[l])I’'‘^(0, 0, 0yx, 0x)||£o, 
vU{Ty) 2r/5max{7,l} 


> 2a- 

> 2a- 


6 6 

(2 inc(LY) + (^Y)ri 3 2max{7,1}?72 

(1 — uy)S 6 


4a: 2a 8a 


2. Suppose ||Ty||2 = 1. Then using conditions (i) — (Hi) of Proposition 5.8, we have: 

l|T’H[2]-4^I*'4(>S'Y,Ty,0yx,0x)||#i,i > ||T’h[ 2]'4^I*-^(0> Ty, 0, 0) 

“ l|T’^(H[2])I*"4(5'y,0,0,0)||2 
- l|T’^(e[2])I*-^(0, 0, 0yx, 0x)||2 
I|T’h[2]-4^I*-^(0, Ly, 0,0)||$i_;^ 
2 ||T’^(z[2])I*‘^(5'y, 0 ,0,0) II2 
2||T’^(z[ 2])I*'4(0, 0, 0yx, 0x)||2 
2a — 2r/3^(f2y)(5 — maxjy, 1} 

2a — 2r/3deg(SY)(5 — 4r/2 maxjy, 1} 
^ 4a 4a „ 8a 

^“■ 7 - 7 - Z 


> 


> 

> 

> 
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3. Suppose ||0yx||2 = 7- Then using conditions (i) and (ii) of Proposition 5.8, we have: 

1 


7 


Ty, 0yx, 0x)]||<I>i,i > - 0) 0)] j 


7 L 


- l|T’^(H[3])I*'^(*S'y,Ty,O,0yx)]||2 

I|T’h[3]“^^I*'^(0) 0 i 0yx, 0)] 


1 

> 


7 L 

2||T’^(Z[3])I*-^(‘S'y, Ty, 0, 0yx)] II 2 

2Ty2/x(f7y)(5 ^ 4 t/2 
7 7 


> 2a- 


2r72deg(SY)(5 . 

> 2a - ^ - 4r/? 

7 

a a 8a 

> 2a---->2a-j 


4. Suppose ||0x||2 = 1- Then using conditions (i) and {ii) of Proposition |5.8[ we have: 

l|T’H[4]‘T^I*-4(5y, Ly, 0yx, 0Js:)]||-I>i,i > ||Ph[ 4]-4^I*-4(0, 0, 0, 0x)||4>i,i 

- l|T’^(H[4])I*-T(5y,Ly, 0yx,O)||2 

> 2a — 72A^(r2y)5 — 2 r ]2 max{ 7 ,1} 

> 2a - 72deg(SY)5 - 2 t/27 

4a 2a 8a 

> 2a---->2a-- 


From these results, we conclude that Ly, 0yx, 0x)] > 2a — Further, we 

can bound the quantity ^(BI, 11-11$^.^) in (2.1) as follows 


X(EI, ll.bs,.,) > 2a - ^ > a 


(5.14) 


Using a similar decoupling technique, one can show: 
‘^<5,7 


T’h^ [7l'^II*7l(5y, Ly , 0yx, 0x)] < 72 + ^ < «(1 “ ^ ^ ^ 


Using this bound and the bound on x(IHI, ||.||$^.^), we control the quantity \\-Us,-y) in P^ : 


IMks,.,) < 


2a - ^ 


< 1 - 


1 + /3 


(5.15) 


Since the bounds (5.14) an d (|5.15 ) are valid for all H G U{ujy,ojyx), Fisher information Assump¬ 
tions 1 and 2 (in (2.5) and (2.6)) are satisfied for ( 5 , 7 ) inside the polyhedron set U(a, z^, wy, wyy). 

□ 
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5.3 High-Dimensional Consistency of the Estimator (1.7) 


In this section, we discuss the consistency properties of the estimator 0 in a high-dimensional 
scaling regime. Specifically, suppose we observe samples C of a collection 

of jointly Gaussian responses and covariates {Y,X) with joint population precision matrix 0* = 
S* — L* 0* \ 

^ ^ where Sy is sparse, Ly is low-rank, and Qyx is column-sparse. Supply- 


X 

0 


YX 


0 


X 


ing these observations into the program (1.7) and obtaining estimates (0, Sy, Ly) C x x 


we prove in Theorem 5.9 that (under certain conditions on 0* and with high probability) (a) the 
column support of Qyx is equal to the column support of Qyx^ (^) rank(LY) = rank(LY), and (c) 
sign(SY) = sign(SY)- Thus, the subset of covariates that are sufficient for predicting the responses 
and the latent-variable graphical model specifying the conditional distribution of the responses 
given the covariates are both correctly identified. 


Proceeding in a similar manner as in Section 2.2, we prove that the estimator (1.7) is consistent 


under assumptions on the conditioning of the population Fisher information I*. These assumptions 
are stated in terms of tangent spaces of the algebraic variety of column-sparse matrices. Letting 
M G be a matrix with k nonzero columns, the tangent space at M with respect to the variety 
of p X q matrices with at most k nonzero columns is given by: 

F{M) = {J € I columnsupport(J) C columnsupport(M)}. 


Here ‘columnsupport’ denotes the indices of the nonzero columns. As in Section 2.2 we control 
the conditioning of 1(0*) for all subspaces H' in the following set: 


U{ujy) = |h(5^) X X F{epx) X I piTY,T{L^)) < Wyj. 


(5.16) 


We control the quantities x(IHI', ^5,7) and y?(IHI', $5,7) (defined in (2.1) and (2.2)) for all H G U{ujy) 
and for ^s,'y{Sy,Ly,Byx,G)x) defined as: 


^5,'y{Sy,Ly,QYX,G)x) — max | , ||Ly||2, II ^ ||0^||2 


7 


(5.17) 


As with in Section 2.2, the norm is a slight variant of the dual norm of the regularizer 
^ll‘S'y||fi + trace(Ly) 7||0yY:||i,2 in ( [LTI ). 

In summary, given {6, 7, toy) G M+ x M_|_ x (0,1) we assume that the population Fisher information 
I* satisfies the following conditions: 

Assumption 3 : inf x(]Hl', $5..^) > a, for some a > 0 

H'eC/(a;y) 

Assumption 4 : sup v^(IHl', < I — v, for some v G (0,1/3). 

H'eP(a;y) 


As with the notation preceding the statement of Theorem 2.1, let ry denote the minimum nonzero 
entry in magnitude of Sy, let cry denote the minimum nonzero singular value of Ly, and let 
deg(S'y) denote the maximal number of nonzeros per row/column of Sy. Further, let Cyx denote 
the minimum £2 norm over nonzero columns of 0yY and let k be the number of nonzero columns 
of 0^Y- 


®The variety of column-sparse matrices is locally flat around Oyx so that the tangent spaces at all points in a 
neighborhood of Qyx s-re all equal to F{Oyx)- 
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Theorem 5.9. Suppose we are given i.i.d observations C of a eollection of 

jointly Gaussian eovariates/responses with population preeision matrix 0* G . Fix a > 0, G 
(0,1/3), wy G (0,1). Suppose the trade-ojf parameters 6 and 7 are chosen sueh that the population 
Fisher information 1(0*) satisfies Assumptions 3 and 4- 


A 3-iy 


Let m = max{|, 1, m = max{(5,1,7}, (3 = ^ 


, and Ip = 11(0*) ^||2- Further, letCi = ^ + 


h, C 2 = f (4+1)’ = Cl V'^maxl 12/3+1, ^^+1}, Csamp = max{ 48^V’^C'f, 8V’C'2, 


'll;' 

and A 


a ''33 

_ _ 1 _ 

“PP®'" mm^ max{deg(Sy),K}Csamp 


C 272 I -’-I’ ^samp — *^^“^1487/3’ 

. Suppose that the following conditions hold: 


l.n> that zs 


‘^upper 


n 


> 


/34 


771*^171^ max{deg(Sy)^, {p + q) 


2. Xn G 


46O8'0^/3^m^(p+g) ^ 


upper 



3 . ty > 27 iC'iA„; that is Ty 
4- cry > mCfjXn; that is ay 



5. Qyx > 276*1 A„; that is Cyx > 



Then with probability greater than 1 — 2exp{ — 4603^32^2^2 }; the optimal solution (0, Sy, Ly) of (1.7) 
with the observations satisfies the following properties: 


1. signlSy) = signfSy), ranklLy) = rankpLy), and eolumnsuppor/Qyx) = columnsupport{Qyy) 


2. <I)5^..),(5y — Sy,Ly — Ly, Qyx — 0yx’ “ 0^) — CiA„; that is ||5y — S'y ^ 
||Ly - L ^||2 < ^rnJ^, || 0 yx - 0yxll2,oo < |7"+ 


A. 



+ if 


I|0x - eill2 < +/+ 


p±q 

n 


The strategy for the proof of Theorem 5.9 is analogous to that of Theorem 2.1 
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